Goto

Collaborating Authors

 Summit County


Minimax optimal differentially private synthetic data for smooth queries

arXiv.org Machine Learning

Differentially private synthetic data enables the sharing and analysis of sensitive datasets while providing rigorous privacy guarantees for individual contributors. A central challenge is to achieve strong utility guarantees for meaningful downstream analysis. Many existing methods ensure uniform accuracy over broad query classes, such as all Lipschitz functions, but this level of generality often leads to suboptimal rates for statistics of practical interest. Since many common data analysis queries exhibit smoothness beyond what worst-case Lipschitz bounds capture, we ask whether exploiting this additional structure can yield improved utility. We study the problem of generating $(\varepsilon,δ)$-differentially private synthetic data from a dataset of size $n$ supported on the hypercube $[-1,1]^d$, with utility guarantees uniformly for all smooth queries having bounded derivatives up to order $k$. We propose a polynomial-time algorithm that achieves a minimax error rate of $n^{-\min \{1, \frac{k}{d}\}}$, up to a $\log(n)$ factor. This characterization uncovers a phase transition at $k=d$. Our results generalize the Chebyshev moment matching framework of (Musco et al., 2025; Wang et al., 2016) and strictly improve the error rates for $k$-smooth queries established in (Wang et al., 2016). Moreover, we establish the first minimax lower bound for the utility of $(\varepsilon,δ)$-differentially private synthetic data with respect to $k$-smooth queries, extending the Wasserstein lower bound for $\varepsilon$-differential privacy in (Boedihardjo et al., 2024).


Shocking video you MUST watch before voting for Mamdani: Here's what will become of NYC under him... and it's worse than everyone fears

Daily Mail - Science & tech

Stunning before-and-after photos show the seven most dramatic changes in Trump's controversial White House makeover She was a respected Teacher of the Year finalist... until she lost everything when Charlie Kirk was killed. Inside Andrew's family summit: How Fergie wailed and'melted down' at title loss, Beatrice and Eugenie were'blindsided' and now daughters' assets face'ethics check' to avoid more scandal: BARBARA DAVIES I have no sympathy for Britney Spears. What if her latest stunt had killed a kid? It's time to admit the truth about this public menace: KENNEDY'Nazi texts' leakers UNMASKED: Alleged White House saboteurs are finally exposed... and so is their twisted motive for destroying political prodigy Extraordinary story behind GM's decision to ax much-loved CarPlay... and sinister reason ALL manufacturers will follow What is Charcot-Marie-Tooth disease... the devastating condition that killed 9-1-1 Nashville actor Isabelle Tate Bijou Phillips files to change daughter's name after ex Danny Masterson's rape conviction Treasure hunters seeking Nazi gold worth £200MILLION believe they have'found the real thing' after'monumental' discovery under remains of SS palace'brothel' Former Gambino mob boss'Sammy the Bull' Gravano reveals the truth behind the NBA betting scandal My wife won't get a job and I feel broken trying to provide for our family. Hold on, says DEAR CAROLINE... that's bad enough but your letter raises a MUCH bigger red flag I got the body of my dreams at 51 by following 9 simple rules, says beauty guru ROSIE GREEN.


SCALAR: A Part-of-speech Tagger for Identifiers

arXiv.org Artificial Intelligence

--The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github 1 Index T erms --Program comprehension, identifier naming, part-of-speech tagging, natural language processing, software maintenance, software evolution I. I NTRODUCTION The identifiers developers create represent a significant amount of the information other developers must use to understand related code. Given that identifiers represent, on average, 70% of the characters in a code base [1], and developers spend more time reading code than writing [2], [3], it is important for researchers to better understand of how identifiers convey information, and how they can be improved to increase developer reading efficiency.


The study of short texts in digital politics: Document aggregation for topic modeling

arXiv.org Artificial Intelligence

Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.


Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning

arXiv.org Artificial Intelligence

Federated learning (FL) is vulnerable to backdoor attacks, where adversaries alter model behavior on target classification labels by embedding triggers into data samples. While these attacks have received considerable attention in horizontal FL, they are less understood for vertical FL (VFL), where devices hold different features of the samples, and only the server holds the labels. In this work, we propose a novel backdoor attack on VFL which (i) does not rely on gradient information from the server and (ii) considers potential collusion among multiple adversaries for sample selection and trigger embedding. Our label inference model augments variational autoencoders with metric learning, which adversaries can train locally. A consensus process over the adversary graph topology determines which datapoints to poison. We further propose methods for trigger splitting across the adversaries, with an intensity-based implantation scheme skewing the server towards the trigger. Our convergence analysis reveals the impact of backdoor perturbations on VFL indicated by a stationarity gap for the trained model, which we verify empirically as well. We conduct experiments comparing our attack with recent backdoor VFL approaches, finding that ours obtains significantly higher success rates for the same main task performance despite not using server information. Additionally, our results verify the impact of collusion on attack performance.


Addressing Small and Imbalanced Medical Image Datasets Using Generative Models: A Comparative Study of DDPM and PGGANs with Random and Greedy K Sampling

arXiv.org Artificial Intelligence

The development of accurate medical image classification models is often constrained by privacy concerns and data scarcity for certain conditions, leading to small and imbalanced datasets. To address these limitations, this study explores the use of generative models, such as Denoising Diffusion Probabilistic Models (DDPM) and Progressive Growing Generative Adversarial Networks (PGGANs), for dataset augmentation. The research introduces a framework to assess the impact of synthetic images generated by DDPM and PGGANs on the performance of four models: a custom CNN, Untrained VGG16, Pretrained VGG16, and Pretrained ResNet50. Experiments were conducted using Random Sampling and Greedy K Sampling to create small, imbalanced datasets. The synthetic images were evaluated using Frechet Inception Distance (FID) and compared to original datasets through classification metrics. The results show that DDPM consistently generated more realistic images with lower FID scores and significantly outperformed PGGANs in improving classification metrics across all models and datasets. Incorporating DDPM-generated images into the original datasets increased accuracy by up to 6%, enhancing model robustness and stability, particularly in imbalanced scenarios. Random Sampling demonstrated superior stability, while Greedy K Sampling offered diversity at the cost of higher FID scores. This study highlights the efficacy of DDPM in augmenting small, imbalanced medical image datasets, improving model performance by balancing the dataset and expanding its size.


The Effects of Hallucinations in Synthetic Training Data for Relation Extraction

arXiv.org Artificial Intelligence

Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact on relation extraction remains underexplored. In this paper, we examine the effects of hallucinations on the performance of relation extraction on the document and sentence levels. Our empirical study reveals that hallucinations considerably compromise the ability of models to extract relations from text, with recall reductions between 19.1% and 39.2%. We identify that relevant hallucinations impair the model's performance, while irrelevant hallucinations have a minimal impact. Additionally, we develop methods for the detection of hallucinations to improve data quality and model performance. Our approaches successfully classify texts as either 'hallucinated' or 'clean,' achieving high F1-scores of 83.8% and 92.2%. These methods not only assist in removing hallucinations but also help in estimating their prevalence within datasets, which is crucial for selecting high-quality data. Overall, our work confirms the profound impact of relevant hallucinations on the effectiveness of relation extraction models.


A Game Designer Just Hid a Gold Trophy in the Woods for a Real-Life Treasure Hunt. It Starts Now

WIRED

Gold Treasure Worth a Fortune Was Hidden in a Forest. For years, Jason Rohrer put out bizarre, beloved video games. Now, with Project Skydrop, he launches the real-world treasure hunt of his dreams. The muddy trail levels out and we stop to catch our breath. Which is good, because hiking with my eyes covered has been a pain in the ass. A voice says: "You can take your blindfold off now." I squint as I get my bearings. Then, after a bit more hiking and some bushwhacking, I finally see it. The thing no one is supposed to know the location of, at least for another few weeks. I have to fight a lizard-brain instinct to reach for it.


Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations

arXiv.org Artificial Intelligence

Large Language Model (LLM)-based recommendation systems provide more comprehensive recommendations than traditional systems by deeply analyzing content and user behavior. However, these systems often exhibit biases, favoring mainstream content while marginalizing non-traditional options due to skewed training data. This study investigates the intricate relationship between bias and LLM-based recommendation systems, with a focus on music, song, and book recommendations across diverse demographic and cultural groups. Through a comprehensive analysis conducted over different LLM-models, this paper evaluates the impact of bias on recommendation outcomes. Our findings reveal that bias is so deeply ingrained within these systems that even a simpler intervention like prompt engineering can significantly reduce bias, underscoring the pervasive nature of the issue. Moreover, factors like intersecting identities and contextual information, such as socioeconomic status, further amplify these biases, demonstrating the complexity and depth of the challenges faced in creating fair recommendations across different groups.


A Jellyfish Cyborg: Exploiting Natural Embodied Intelligence as Soft Robots

arXiv.org Artificial Intelligence

In the advanced field of bio-inspired robotics, the emergence of cyborgs represents the successful integration of engineering and biological systems. Building on previous research that showed how electrical stimuli could initiate and speed up a jellyfish's movement, this study presents a groundbreaking approach that explores how the natural embodied intelligence of the animal can be harnessed to address pivotal challenges such as spontaneous exploration, navigation in various environments, control of whole-body motion, and real-time predictions of behavior. We have developed a comprehensive data acquisition system and a unique setup for stimulating jellyfish, allowing for a detailed study of their movements. Through careful analysis of both spontaneous behaviors and behaviors induced by targeted stimulation, we have identified subtle differences between natural and induced motion patterns. By using a machine learning method called physical reservoir computing, we have successfully shown that future behaviors can be accurately predicted by directly measuring the jellyfish's body shape when the stimuli align with the animal's natural dynamics. Our findings also reveal significant advancements in motion control and real-time prediction capabilities of jellyfish cyborgs. In summary, this research provides a comprehensive roadmap for optimizing the capabilities of jellyfish cyborgs, with potential implications in marine reconnaissance and sustainable ecological interventions.